Head of Bioinformatics Core, CRUK Cambridge Institute

Outline

  • Why you need to develop computational and analytical skills

  • Opportunities for biologists to undertake bioinformatic analysis

  • Challenges you may face


3 real life examples of biologists developing bioinformatics skills at

Data driven biology

Technological advances have accelerated scientific discovery but place increasing demands on data handling and analysis capabilities.

  • Genomic sequencing
  • Quantitative proteomics
  • High dimensional flow cytometry
  • Single cell omics


Computational tools and workflows are increasingly designed for use by biologists.

Availability and accessibility of computational tools


Programming languages such as R are now easier to use

  • RStudio
  • Tidyverse packages for data science

Drive for openness, transparency and reproducibility

  • R notebooks for reports containing code and results
  • GitHub for managing source code and version control

Availability and accessibility of computational tools

Web-based platforms

  • user-friendly interfaces
  • reduce complexity of running bioinformatics tools
  • Galaxy, GenePattern, BaseSpace, DNAnexus,…

Cloud-based computing

  • large-scale data processing

Containers

  • ease packaging and delivery of software

Singularity

Combining and manipulating data

Aim is to find countries with lowest population densities.


How would you do this?   What steps are involved?

Combining and manipulating data in R

# load tidyverse packages containing the functions we'll need
library(readxl)
library(dplyr)

# read population spreadsheet into R and change column headers
populations <- read_excel("country_data.xlsx", sheet = 1, skip = 1)
colnames(populations) <- c("country", "population")

# read land area spreadsheet into R
areas <- read_excel("country_data.xlsx", sheet = 2)
colnames(areas) <- c("country", "total_area", "water_area", "notes")

# calculate land area
areas <- mutate(areas, land_area = total_area - water_area)

# combine population and area tables
combined_data <- full_join(populations, areas, by = "country")

# calculate population density
combined_data <- mutate(combined_data, density = population / land_area)

Sorting and filtering in R

# sort into order of population density, filter those with the lowest values and select columns for display
combined_data %>%
  arrange(density) %>%
  filter(density < 5.0) %>%
  select(country, population, total_area, water_area, land_area, density, notes)
## # A tibble: 6 x 7
##   country  population total_area water_area land_area density notes                       
##   <chr>         <dbl>      <dbl>      <dbl>     <dbl>   <dbl> <chr>                       
## 1 Mongolia    3027398    1564110      10560   1553550    1.95 <NA>                        
## 2 Namibia     2479713     825615       2425    823190    3.01 <NA>                        
## 3 Austral…   24125848    7692024      58459   7633565    3.16 The largest country in Ocea…
## 4 Iceland      332474     103000       2750    100250    3.32 <NA>                        
## 5 Libya       6293253    1759540          0   1759540    3.58 <NA>                        
## 6 Canada     36289822    9984670     891163   9093507    3.99 Largest country in the West…


The dplyr package provides a useful set of functions for manipulating, combining and filtering tabular data.

Example 1 – Combining and filtering tabular data

Multispectral optoacoustic tomography

Image courtesy of Isabel Quiroz Gonzales

Isabel is combining selected measurements from multiple imaging runs and filtering for interesting results.

Combining and manipulating tabular data

Opportunities


Automation

  • Repetitive and error-prone data handling tasks can be handled efficiently

Reuse

  • Re-run analysis scripts easily on new or updated data

Reproducibility

  • Record of analysis steps carried out


Challenges

  • Learning R can be hard

  • Many ways of achieving the same thing in R but which one to use?


The tidyverse

You know what you want to do but how do you find the right function to use?


  • tidyverse collection of R packages that "make data science faster, easier and more fun"

Example 2 – Reformatting and visualizing data

Mathilde is monitoring the weight of mice at various timepoints and wants to plot the weight, scaled by the maximum weight of each mouse, by age.

Courtesy of Mathilde Colombe

Example 2 – Reformatting and visualizing data

Image courtesy of Mathilde Colombe

Example 2 – Reformatting data for analysis and visualization

A common problem – the format used to collect the data is not suitable for analysis and visualization.

We need to reformat the data into a tidy form.

Example 2 – Reformatting data for analysis and visualization

Step 1:  Read the data into R in its current wide format

library(tidyverse)
data <- read_csv("muc2_weight.csv")
data
## # A tibble: 37 x 65
##   ID    `Cage number` Sex   DOB   `Initial weight` `25/09/2017` `02/10/2017` `09/10/2017`
##   <chr> <chr>         <chr> <chr>            <dbl>        <dbl>        <dbl>        <dbl>
## 1 AN17… 13536         M     01/0…             32.5         32.5         32.7         33.6
## 2 AN17… 9381          F     10/0…             28.4         28.4         27.5         29.5
## 3 18/1… 347           F     19/1…             11.4         NA           NA           NA  
## 4 18/1… 347           F     19/1…             12.5         NA           NA           NA  
## 5 18/1… 348           M     19/1…             12.5         NA           NA           NA  
## 6 18/1… 349           M     19/1…             14.1         NA           NA           NA  
## # … with 31 more rows, and 57 more variables: `16/10/2017` <dbl>, `24/10/2017` <dbl>,
## #   `31/10/2017` <dbl>, `07/11/2017` <dbl>, `14/11/2017` <dbl>, `21/11/2017` <dbl>,
## #   `28/11/2017` <dbl>, `05/12/2017` <dbl>, `12/12/2017` <dbl>, `19/12/2017` <dbl>,
## #   `27/12/2017` <dbl>, `02/01/2018` <dbl>, `09/01/2018` <dbl>, `16/01/2018` <dbl>,
## #   `23/01/2018` <dbl>, `30/01/2018` <dbl>, `06/02/2018` <dbl>, `13/02/2018` <dbl>,
## #   `20/02/2018` <dbl>, `27/02/2018` <dbl>, `06/03/2018` <dbl>, `13/03/2018` <dbl>,
## #   `20/03/2018` <dbl>, `27/03/2018` <dbl>, `03/04/2018` <dbl>, `18/04/2018` <dbl>,
## #   `25/04/2018` <dbl>, `02/05/2018` <dbl>, `10/05/2018` <dbl>, `17/05/2018` <dbl>,
## #   `24/05/2018` <dbl>, `30/05/2018` <dbl>, `06/06/2018` <dbl>, `14/06/2018` <dbl>,
## #   `20/06/2018` <dbl>, `27/06/2018` <dbl>, `04/07/2018` <dbl>, `11/07/2018` <dbl>,
## #   `18/07/2018` <dbl>, `25/07/2018` <dbl>, `01/08/2018` <dbl>, `08/08/2018` <dbl>,
## #   `15/08/2018` <dbl>, `22/08/2018` <dbl>, `29/08/2018` <dbl>, `05/09/2018` <dbl>,
## #   `12/09/2018` <dbl>, `19/09/2018` <dbl>, `26/09/2018` <dbl>, `03/10/2018` <dbl>,
## #   `10/10/2018` <dbl>, `17/10/2018` <dbl>, `24/10/2018` <dbl>, `31/10/2018` <dbl>,
## #   `07/11/2018` <dbl>, `14/11/2018` <dbl>, `21/11/2018` <dbl>

Example 2 – Reformatting data for analysis and visualization

Step 2:  Convert from wide format to long (or tidy) format

data <- pivot_longer(data, cols = 6:65, names_to = "Date", values_to = "Weight", values_drop_na = TRUE)
data
## # A tibble: 594 x 7
##   ID              `Cage number` Sex   DOB        `Initial weight` Date       Weight
##   <chr>           <chr>         <chr> <chr>                 <dbl> <chr>       <dbl>
## 1 AN17/16957 (1L) 13536         M     01/05/2017             32.5 25/09/2017   32.5
## 2 AN17/16957 (1L) 13536         M     01/05/2017             32.5 02/10/2017   32.7
## 3 AN17/16957 (1L) 13536         M     01/05/2017             32.5 09/10/2017   33.6
## 4 AN17/16957 (1L) 13536         M     01/05/2017             32.5 16/10/2017   32.3
## 5 AN17/16957 (1L) 13536         M     01/05/2017             32.5 24/10/2017   32.1
## 6 AN17/16957 (1L) 13536         M     01/05/2017             32.5 31/10/2017   33.4
## # … with 588 more rows

The tidyr package provides functions to transform your data into a tidy format.

Example 2 – Data manipulation for visualization

Step 3:  Convert dates and calculate age

library(lubridate)
data <- mutate(data, DOB = dmy(DOB), Date = dmy(Date), Age = Date - DOB)
select(data, ID, DOB, Date, Age, Weight)
## # A tibble: 594 x 5
##   ID              DOB        Date       Age      Weight
##   <chr>           <date>     <date>     <drtn>    <dbl>
## 1 AN17/16957 (1L) 2017-05-01 2017-09-25 147 days   32.5
## 2 AN17/16957 (1L) 2017-05-01 2017-10-02 154 days   32.7
## 3 AN17/16957 (1L) 2017-05-01 2017-10-09 161 days   33.6
## 4 AN17/16957 (1L) 2017-05-01 2017-10-16 168 days   32.3
## 5 AN17/16957 (1L) 2017-05-01 2017-10-24 176 days   32.1
## 6 AN17/16957 (1L) 2017-05-01 2017-10-31 183 days   33.4
## # … with 588 more rows

The dplyr package provides functions for data manipulation that work together in a consistent and coherent manner.

Example 2 – Within group data manipulation

Step 4:  Scale the weights by the maximum recorded weight for each individual

data <- data %>%
  group_by(ID) %>%
  mutate(ScaledWeight = Weight / max(Weight)) %>%
  ungroup()

select(data, ID, DOB, Date, Age, Weight, ScaledWeight)
## # A tibble: 594 x 6
##   ID              DOB        Date       Age      Weight ScaledWeight
##   <chr>           <date>     <date>     <drtn>    <dbl>        <dbl>
## 1 AN17/16957 (1L) 2017-05-01 2017-09-25 147 days   32.5        0.923
## 2 AN17/16957 (1L) 2017-05-01 2017-10-02 154 days   32.7        0.929
## 3 AN17/16957 (1L) 2017-05-01 2017-10-09 161 days   33.6        0.955
## 4 AN17/16957 (1L) 2017-05-01 2017-10-16 168 days   32.3        0.918
## 5 AN17/16957 (1L) 2017-05-01 2017-10-24 176 days   32.1        0.912
## 6 AN17/16957 (1L) 2017-05-01 2017-10-31 183 days   33.4        0.949
## # … with 588 more rows

'%>%' pipes the output of one operation into the next.

Example 2 – Data visualization

Step 5:  Create weight vs age plot

ggplot(data = data, mapping = aes(x = Age, y = ScaledWeight, colour = ID)) +
  geom_line()

Example 2 – Data visualization

Step 5:  Create separate weight vs age plots

ggplot(data = data, mapping = aes(x = Age, y = ScaledWeight, colour = ID)) +
  geom_line() +
  facet_wrap(vars(ID))

Example 2 – Data visualization

Change representation to a histogram


ggplot(data = data, mapping = aes(x = ID, y = Weight, colour = Sex)) +
  geom_boxplot()

Data visualization

Opportunities


Exploratory data analysis

  • Plotting is a useful way to explore and understand your data

  • Learning the ggplot2 'grammar of graphics' means you'll be able to create a wide range of plots using a common syntax


Reuse

  • Rerun R scripts to recreate plots for new or updated data

Publication quality graphics


Challenges

  • R has a steep learning curve

  • ggplot2 is relatively straightforward once you get the hang of it but customizing plots to be exactly as you want can be tricky


Example 3 – Running an analysis pipeline

Carolin is investigating the relationship between copy number alterations and centrosome instability in ovarian carcinomas and is processing large numbers of samples using shallow whole genome sequencing.

Courtesy of Carolin Sauer

Example 3 – Running an analysis pipeline

Opportunity

The Brenton lab need to be able to run the analysis themselves as and when data are available


Problem

  • Complex, computationally expensive workflow

Example 3 – Running an analysis pipeline

Opportunity

The Brenton lab need to be able to run the analysis themselves as and when data are available


Problem

  • Complex, computationally expensive workflow

Solution

  • Develop robust pipeline from user requirements and SOPs for configuring and running the pipeline

  • 180+ pipeline runs, 3000+ samples

Example 3 – R notebooks for collaborative research

Example 3 – R notebooks for collaborative research

Running analysis pipelines

Opportunities


Access to large-scale data processing

  • Systematic approach for large cohort studies

  • Biologists don't have to wait for a bioinformatician to process their data, obtain results sooner


Accelerated pipeline development

  • Bioinformaticians can spend more time on developing new methods/features

Off-the-shelf analysis packages and pipelines


Challenges

  • Unix command line

  • Use of high-performance compute clusters or cloud computing

  • Workflow engines add a level of complexity

  • Troubleshooting failures in pipeline runs not for the faint-hearted


Conclusions

Learning R will empower you to:

  • explore, analyze and visualize your data more effectively

  • handle repetitive and error-prone tasks efficiently

  • create elegant reports that combine your code, results, plots and narrative text